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Abstract 

Recent work in metric learning has significantly improved the state-of-the-art in fc-nearest 
neighbor classification. Support vector machines (SVM), particularly with RBF kernels, are 
amongst the most popular classification algorithms that uses distance metrics to compare 
examples. This paper provides an empirical analysis of the efficacy of three of the most 
popular Mahalanobis metric learning algorithms as pre-processing for SVM training. We 
show that none of these algorithms generate metrics that lead to particularly satisfying 
improvements for SVM- RBF classification. As a remedy we introduce support vector metric 
learning (SVML), a novel algorithm that seamlessly combines the learning of a Mahalanobis 
metric with the training of the RBF-SVM parameters. We demonstrate the capabilities of 
SVML on nine benchmark data sets of varying sizes and difficulties. In our study, SVML 
outperforms all alternative state-of-the-art metric learning algorithms in terms of accuracy 
and establishes itself as a serious alternative to the standard Euclidean metric with model 
selection by cross validation. 

Keywords: metric learning, distance learning, support vector machines, semi-definite programming, 
Mahalanobis distance 



1. Introduction 



Many machine learning algorithms, such as fc-nearest neighbors (kNN) (Cover and Hart 



1967 


), fc-means ( 


Lloid 


1982) 



with shift-invariant kernels, require a distance metric to compare instances. These algo- 
rithms rely on the assumption that semantically similar inputs are close, whereas semanti- 
cally dissimilar inputs are far away. Traditionally, the most commonly used distance metrics 
are uninformed norms, like the Euclidean distance. In many cases, such uninformed norms 
are sub-optimal. To illustrate this point, imagine a scenario where two researchers want to 
classify the same data set of facial images. The first one classifies people by age, the second 
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by gender. Clearly, two images that are similar according to the first researcher's setting 
might be dissimilar according to the second's. 

Uninformed norms ignore two important contextual components of most machine learn- 
ing applications. First, in supervised learning the data is accompanied by labels which 
essentially encode the semantic definition of similarity. Second, the user knows which ma- 
chine learning algorithm will be used. Ideally, the distance metric should be tailored to the 
particular setting at hand, incorporating both of these considerations. 



A generalization of the Euclidean distance is the Mahalanobis distance (Mahalanobis 



1936). Recent years have witnessed a surge of innovation on Mahalanobis pseudo- metric 



learning (Davis et al. 2007 Globerson and Roweis, 2005, Goldberger et al. 2005 Shental 



et al. 2002 Weinberger et al. 2006). Although these algorithms use different methodolo- 



gies, the common theme is moving similar inputs closer and dissimilar inputs further away 
- where similarity is generally defined through class membership. This transformation 



can be learned through convex optimization with pairwise constraints (Davis et al., 2007 



Weinberger et al. 2006), gradient descent with soft neighborhood assignments (Goldberger 



Weinberger 


et al. 


et al. 




2005 


), or s 



et al. 2005), or spectral methods based on second-order statistics (Shental et al. 2002). 



Typically, the Mahalanobis metric learning algorithms are used in a two-step approach. 
First the metric is learned, then it is used for training the classifier or clustering algorithm of 
choice. The resulting distances are semantically more meaningful than the plain Euclidean 
distance as they reflect the label information. This makes them particularly suited for 



the /c- nearest neighbor rule, leading to large improvements in classification error (Davis 



et~aT] [20071 |Globerson and Roweis] [2005] |Goldberger et al.j [2005] |Shental et al] [2002] 



Weinberger et al. 2006). In fact, several algorithms explicitly mimic the fc-NN rule and 



minimize a surrogate loss function of the corresponding leave-one-out classification error on 
the training set (Goldberger et al. 2005; Weinberger et al. 2006). 

Although the fc-nearest neighbor rule can be a powerful classifier especially in settings 
with many classes, it comes with certain limitations. For example, the entire training data 
needs to be stored and processed during test time. Also, in settings with fewer classes 



(especially binary) it is generally outperformed by Support Vector Machines (Cortes and 



Vapnik, 1995[ ). Because of their high reliability as out-of-the-box classifiers, SVMs have 



become one of the quintessential classification algorithms in many areas of science and 
beyond. An important part of using SVMs is the right choice of kernel. The kernel function 
k(xi,Xj) encodes the similarity between two input vectors Xj and x-,-. There are many 
possible choices for such a kernel function. One of the most commonly used kernels is the 
Radial Basis Function (RBF) kernel (Scholkopf and Smola, 2002), which itself relies on a 
distance metric. 

This paper considers metric learning for support vector machines. As a first contribution, 
we review and investigate several recently published kNN metric learning algorithms for 
the use of SVMs with RBF kernels. We demonstrate empirically that these approaches do 
not reliably improve SVM classification results up to statistical significance. As a second 
contribution, we derive a novel metric learning algorithm that specifically incorporates the 
SVM loss function during training. Here, we learn the metric to minimize the validation 
error of the SVM prediction at the same time that we train the SVM. This is in contrast 
to the two-step approach of first learning a metric and then training the SVM classifier 
with the resulting kernel. This algorithm, which we refer to as Support Vector Metric 
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Learning (SVML), is particularly useful for three reasons. First, it achieves state-of-the- 
art classification results and clearly outperforms other metric learning algorithms that are 
not explicitly geared towards SVM classification. Second, it provides researchers outside of 
the machine-learning community a convenient way to automatically pre-process their data 
before applying SVMs. 

This paper is organized as follows. In Section [2j we introduce necessary notation and 
review some background on SVMs. In Section [3] we introduce several recently published 
metric learning algorithms and report results for SVM-RBF classification. In Section [4] 
we derive the SVML algorithm and some interesting variations. In Section [5j we evaluate 
SVML on nine publicly available data sets featuring a multitude of different data types and 
learning tasks. We discuss related work in Section [6] and conclude in Section [7} 



2. Support Vector Machines 



Let the training data consist of input vectors {xi, . . . , x n } 6 TZ d with corresponding discrete 
class labels {yi, . . . , y„} £ {+1, — 1}- Although our framework can easily be applied in a 
multi-class setting, for the sake of simplicity we focus on binary scenarios, restricting yi to 
two classes. 

There are several reasons why SVMs are particularly popular classifiers. First, they 
are linear classifiers that involve a quadratic minimization problem, which is convex and 
guarantees perfect reproducibility. Furthermore, the maximum margin philosophy leads to 



reliably good generalization error rates (Vapnik, 1998). But perhaps most importantly, the 



kernel-trick (Scholkopf and Smola, 2002 ) allows SVMs to generate highly non-linear decision 
boundaries with low computational overhead. More explicitly, the kernel-trick maps the 
input vectors Xj implicitly into a higher (possibly infinite) dimensional feature space with 
a non-linear transformation <p : lZ d — > %. Training a linear classifier directly in this high 
dimensional feature space T~L would be computationally infeasible if the vectors <?!>(xj) were 
accessed explicitly. However, SVMs can be trained completely in terms of inner-products 
between input vectors. With careful selection of <^>(), the inner-product 0(xj) T ^(xj) can 
be computed efficiently even if computation of the mapping <p() itself is infeasible. Let the 
kernel function be A:(xj,Xj) = </)(xj) T (/>(xj) and the nxn kernel matrix be Kjj = /c(xj,Xj). 
The optimization problem of SVM training can be expressed entirely in terms of the kernel 
matrix K. For the sake of brevity, we omit the derivation and refer the interested reader 
to one of many detailed descriptions thereof ( Scholkopf and Smola , 2002 ) . The resulting 
classification rule of a test point x^ becomes 



h(x t ) = sign(^ ajyjk(xj,xt) + b), 

3=1 



(1) 



where b is the offset of the separating hyperplane and a%, . . . ,a n are the dual variables 
corresponding to the inputs xi, . . . , x n . In the case of the hard- margin SVM, the parameters 
oti are learned with the following quadratic optimization problem 



n 1 n 

in - - ^2 a i a j y i y j K(xi,x j 



mm 

ai,...,a n '- 

i=l 
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n 

subject to : = and an > 0. (2) 

i=l 

The optimization problem ^ ensures that all inputs Xj with label y% = —1 are on one side 
of the hyperplane, and those with label yj = +1 are on the other. These hard constraints 
might not always be feasible, or in the interest of minimizing the generalization error (e.g. 
in the case of noisy data). Relaxing the constraints can be performed simply by altering 
the kernel matrix to 

K^K + ^I nxn . (3) 

Solving Q with a kernel matrix ^ is equival ent to a squared-penalty of the violations of 
the separating hyperplane (Cortes and Vapnik, 1995). This formulation requires no explicit 
slack variables in the optimization problem and therefore simplifies the derivations of the 
following sections. 



2.1 RBF Kernel 

There are many different kernel functions that are suitable for SVMs. In fact, any function 



•) is a well-defined kernel as long as it is positive semi-definite (Scholkopf and Smola 



2002). The Radial Basis Function (RBF)-Kernel is defined as follows: 

k(^ j ) = e~ d2 ^\ (4) 

where d(-, •) is a dissimilarity measure that must ensure positive semidefiniteness of k(-, •). 
The most common choice is the re-scaled squared Euclidean distance, defined as 

d 2 (xi,Xj) = -^(xj -Xj) T (xj -Xj), (5) 

with kernel width a > 0. The RBF-kernel is one of the most popular kernels and yields 
reliable good classification results. Also, with careful selection of C, SVMs with RBF- 



kernels have been shown to be consistent classifiers (Steinwart, 2002). 



2.2 Relationship with kNN 

The fc-nearest neighbor classification rule predicts the label of a test point xt through a 
majority vote amongst its k nearest neighbors. Let rjj(xt) G {0,1} be the neighborhood 
indicator function of a test point xj, where rjj(xt) = 1 if and only if xj is one of the k 
nearest neighbors of Xf. The kNN classification rule can then be expressed as 

n 

h(x t ) = sign(J^fy(xt)yj-)- (6) 
i=i 

Superficially, the classification rule in Q very much resembles Q. In fact, one can interpret 
the SVM-RBF classification rule in ([!]) as a "soft "-nearest neighbor rule. Instead of the zero- 
one step function r]j(x t ), the training points are weighted by ajk(x t ,Xj). The classification is 
still local-neighborhood based, as k(xt,Xj) decreases exponentially with increasing distance 
d(xtjXj). The SVM optimization in (|2j) assigns appropriate weights aj > to ensure that, 
on the leave-one-out training set, the majority vote is correct for all data points by a large 
margin. 
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3. Metric Learning 



It is natural to ask if the SVM classification rule can be improved with better adjusted 
metrics than the Euclidean distance. A commonly used generalization of the Euclidean 



metric is the Mahalanobis metric (Mahalanobis, 1936), defined as 



d M (xj,Xj) = W(xj - Xj) T M(xj - x,-), 



(7) 



for some matrix M G lZ dxd . The matrix M must be semi-positive definite (M y 0), which 
is equivalent to requiring that it can be decomposed into M = L T L, for some matrix 
L G lZ rxd . If M = I dxd , where I dxd refers to the identity matrix in K dxd , Q reduces 
to the Euclidean metric. Otherwise, it is equivalent to the Euclidean distance after the 
transformation Xj — > Lxj. Technically, if M = L T L is a singular matrix, the corresponding 
Mahalanobis distance is a pseudo-metrit^ Because the distinction between pseudo-metric 
and metric is unimportant for this work, we refer to both as metrics. As the distance in Q 
can equally be parameterized by L and M we use cIm an d c^l interchangeably. 

In the following section, we will introduce several approaches that focus on Mahalanobis 
metric learning for /c-nearest neighbor classification. 



3.1 Neighborhood component analysis 



Goldberger et al. (2005) propose Neighborhood Component Analysis (NCA), which min- 
imizes the expected leave-one-out classification error under a probabilistic neighborhood 
assignment. For each data point or query, the neighbors are drawn from a softmax proba- 
bility distribution. The probability of sampling Xj as a neighbor of Xj is given by: 



-d L ( Xi ,x ■) 



Pij 



E 



if i = j 



(8) 



Let us define an indicator variable yij G {0, 1} where y^ = 1 if and only if yi = yj. With 
the probability assignment described in ([8]), we can easily compute the expectation of the 
leave-one-out classification accuracy as 

^ n n 

A ioo= n ^Z^PiiVii- (9) 
i=i j=i 

NCA uses gradient ascent to maximize ([9]). The advantage of the probabilistic framework 
over reg ular kNN is that ^ is a continuous, differentiable function with respect to the linear 
transformation L. By contrast, the leave-one-out error of regular kNN is not continuous or 
differentiable. The two down-sides of NCA are its relatively high computational complexity 
and non-convexity of the objective. 

1. A pseudo-metric is not require to preserve identity, i.e. d(xi,Xj) = = Xj. 
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3.2 Large Margin Nearest Neighbor Classification 



Large Margin Nearest Neighbor (LMNN), proposed by Weinberger et al. (2006), also mimics 
the leave-one-out error of kNN. Unlike NCA, LMNN employs a convex loss function, and 
encourages local neighborhoods to have the same labels by pushing data points with different 
labels away and pulling those with similar labels closer. The authors introduce the concept 
of target neighbors. A target neighbor of a training datum Xj are data points in the training 
set that should ideally be the nearest neighbors (e.g. the closest points under the Euclidean 
metric with the same class label). LMNN moves these points closer by minimizing 



X; 



(10) 



where j ~» i indicates that x,- is a target neighbor of Xj. In addition to the objective (10), 



LMNN also enforces that no datum with a different label can be closer than a target 
neighbor. In particular, let Xj be a training point and Xj one of its target neighbors. Any 
point Xfc of different class membership than Xj should be further away than Xj by a large 
margin. LMNN encodes this relationship as linear constraints with respect to M. 



^M( x '«i x fc) — ^m( x *i x :/) 1 



(11) 



LMNN uses semidefinite programming to minimize (10) with respect to (11). To account 



for the natural limitations of a single linear transformation the authors introduce slack 
variables. More explicitly, for each triple (i,j,k), where x^ is a target neighbor of Xj and 
Vk 7^ Hi, they introduce > which absorbs small violations of the constraint (11). The 



resulting optimization problem can be formulated as the following semi-definite program 



(SDP) (Boyd and Vandenberghe 2004): 



min V"djy[(xi,x 

subject to: 

(1) ^(xi,X fc ) 

(2) &ifc > o 



; / + /' ^ v,. j 

j~>i,k:y k jL yi 



4t' 



^ijk 

i 

x i ! X J ) — 1 Cijk 



Here /i > defines the trade-off between minimizing the objective and penalizing constraint 
violations (by default we set fj, = 1). 



3.3 Information-Theoretic Metric Learning 

Different from NCA and LMNN, Information-Theoretic Metric Learning (ITML), proposed 



by Davis et al. (2007), does not minimize the leave-one-out error of kNN classification. In 



contrast, ITML assumes a uni- modal data-distribution and clusters similarly labeled inputs 
close together while regularizing the learned metric to be close to some pre-defined initial 
metric in terms of Gaussian cross entropy (for details see Davis et al. (2007)). Similar 



to LMNN, ITML also incorporates the similarity and dissimilarity as constraints in its 
optimization. Specifically, ITML enforces that similarly labeled inputs must have a distance 
smaller than a given upper bound g?m ( x « , x j ) < u and dissimilarly labeled points must be 
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further apart than a pre-defined lower bound dM( x i, x j) > I- If we denote the set of similarly 
labeled input pairs as S, and dissimilar pairs as D, the optimization problem of ITML is: 



min trfMMn 1 ) - logdetfMMn 1 ) 
M^o u u 

subject to: 

(!) ^mO^x?) < u y(i,j) G S, 
(2) d 2 M ( Xi , Xj ) >l V(i,j)€D. 



Davis et al. (2007) introduce several variations, including the incorporation of slack- variables. 
One advantage of the particular formulation of the ITML optimization problem is that the 
SDP constraint M y does not have to be monitored explicitly through eigenvector de- 
compositions but is enforced implicitly through the objective. 



Statistics 


Haber 


Credit 


ACredit 


Trans 


Diabts 


Mammo 


CMC 


Page 


Gamma 


^examples 


306 


653 


690 


748 


768 


830 


962 


5743 


19020 


^features 


3 


15 


14 


4 


8 


5 


9 


10 


11 


^training exam. 


245 


522 


552 


599 


614 


664 


770 


4594 


15216 


^testing exam. 


61 


131 


138 


150 


154 


166 


192 


1149 


3804 


Metric 


Error Rates 


Euclidean 


27.37 


13.12 


14.11 


20.54 


23.46 


18.17 


26.91 


2.56 


12.62 


ITML 


26.50 


13.68 


14.71 


22.86 


23.14 


18.20 


27.67 


4.78 


21.50 


NCA 


26.39 


13.48 


14.10 


22.59 


22.74 


18.17 


26.53 


4.74 


N/A 


LMNN 


26.70 


13.48 


13.89 


20.81 


22.89 


17.78 


26.68 


2.66 


13.04 



Table 1: Error rates of SVM classification with an RBF kernel (all parameters were set by 
5-fold cross validation) under various learned metrics. 



3.4 Metric Learning for SVM 

We evaluate the efficacy of NCA, ITML and LMNN as pre-processing step for SVM classifi- 
cation with an RBF kernel. We used nine data sets from the UCI Machine Learning repos- 



itory (Frank and Asuncion, 2010) of varying size, dimensionality and task description. The 
data sets are: Haberman's Survival (Haber), Credit Approval (Credit), Australian Credit 
Approval (ACredit), Blood Transfusion Service (Trans), Diabetes (Diabts), Mammographic 
Mass (Mammo), Contraceptive Method Choice (CMC), Page Blocks Classification (Page) 
and MAGIC Gamma Telescope (Gamma). 

For simplicity, we restrict our evaluation to the binary case and convert multi-class 
problems to binary ones, either by selecting the two most-difficult classes or (if those are 
not known) by grouping labels into two sets. Table [T] details statistic and classification 
results on all nine data sets. The best values up to statistical significance (within a 5% 
confidence interval) are highlighted in bold. To be fair to all algorithms, we re-scale all 
features to have standard deviation 1 . We follow the commonly used heuristic for Euclidean 
and initialize NCA and ITML with Lq = ^1 for all experiments (where d denotes the 



2. The choice of a — # features is also the default value for the LibSVM toolbox (Chang and Lin 2001) 
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# features). As LMNN is known to be very parameter insensitive, we set fi to the default 
value of fi = 1. All SVM parameters (C and a 2 ) were set by 5-fold cross validation on the 
training sets, after the metric is learned. The results on the smaller data sets (n < 1000) 
were averaged over 200 runs with random train/test splits, Page Blocks (Page) was averaged 
over 20 runs and Gamma was run once (here the train/test splits are pre-defined). 

In terms of scalability, NCA is by far the slowest algorithm and our implementation 
did not scale up to the (largest) Gamma data set. LMNN and ITML require comparable 
computation time (on the order of several minutes for the small- and 1-2 hours for large 
data sets - for details see Section [6]) . As a general trend, none of the three metric learning 
algorithms consistently outperforms the Euclidean distance. Given the additional compu- 
tation time, it is questionable if either one is a reasonable pre-processing step of SVM-RBF 
classification. This is in large contrast with the drastic improvements that these metric 



learning algorithms obtain when used as pre-processing for kNN (Goldberger et al., 2005 



Weinberger et abj |2006 Davis et al. 2007). One explanation for this discrepancy could be 
based on the subtle but important differences between the kNN classification rule ^ and 
the one of SVMs Q. In the remainder of this paper we will explore the possibility to learn 
a metric explicitly for the SVM decision rule. 



4. Support Vector Metric Learning 

As a first step towards learning a metric specifically for SVM classification, we incorporate 
the squared Mahalanobis distance ([T]) into the kernel function Q and define the resulting 
kernel function and matrix as 

fc L (x,, Xj ) = e -(x 4 -x i ) T LTL( X4 -x j ) and K .. = fcL ( x . )X .). (12) 

As mentioned before, the typical Euclidean RBF setting is a special case where L = Ii dxrf . 



4.1 Loss function 

In the Euclidean standard way to se- 

lect the meta parameter a is through cross- 
validation. In its simplest form, this involves 
splitting the training data set into two mutu- 
ally exclusive subsets: training set T and val- 
idation set V. The SVM parameters ai,b are 
then trained on T and the outcome is evaluated 
on the validation data set V. After a gridsearch 
over several candidate values for a (and C), the 
setting that performs the best on the validation 
data is chosen. For a single meta parameter, 
search by cross validation is simple and surpris- 
ingly effective. If more meta parameters need 
to be set — in the case of choosing a matrix L, 
this involves d X d entries — the number of pos- 
sible configurations grows exponentially and the 
gridsearch becomes infeasible. 




Figure 1: The function s a (z) is a soft 
(differentiable) approximation 
of the zero-one loss. The pa- 
rameter a adjusts the steepness 
of the curve. 
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We follow the intuition of validating meta parameters on a hold-out set of the training 
data. Ideally, we want to find a metric parameterized by L that minimize the classification 
error £y on the validation data set 



L = argmin£V(L) where: £y(L) 

L 



1 

W\ 



(x,2/)6V 



y\- 



Here [h(x) = y] £ {0, 1} takes on value 1 if and only if h(x) = y. The classifier h(-), defined 
in ([I]) depends on parameters on and b, which are re-trained for every intermediate setting 
of L. Performing the minimization in (13) is non-trivial because the sign(-) function in ([!]) 
is non-continuous. We therefore introduce a smooth loss function Cy, which mimics £y, 
but is better behaved. 



ML) 



1 

TvT 



(x,j/)ev 



s a (yh(x)) where: s a (z) 



l + e a 



(13) 



The function s a (z) is the mirrored sigmoid function, a soft approximation of the zero-one 
loss. The parameter a adjusts the steepness of the curve. In the limit, as o > the function 
Cy becomes identical to £y. Figure [T] illustrates the function s a (-) for various values of a. 



4.2 Gradient Computation 

Our surrogate loss function Cy is continuous and differentiable so we can compute the 



derivative 



dh(x) ■ 

the chain-rule and also compute 



To obtain the derivative of Cy 

dfe(x) 



with respect to L we need to complete 



<9L 



The SVM prediction function h(x), defined in (1), 



depends on L indirectly through aij, b and K. In the next paragraph we follow the original 
approach of ( Chapelle et al. , 2002 ) for kernel parameter learning. This approach has also 



been used successfully for wrapper-based multiple-kernel-learning (Rakotomamonjy et al 



2008 ISonnenburg et al. 2006 Kloft et al. 2010) 



h(x) by h and use the vector notation a 
derivative of h results in: 



ai, 



, a 



For ease of notation, we abbreviate 
T . Applying the chain-rule to the 



dh dh da dh <9K dh db 
&L ~ Ikxdh^ ~&K~dL + ~dbdL' 



(14) 



The derivatives 9h 



lh (Petersen and Pedersen , 2008). In order to compute ^ and ^ , we express the vector 



jjjj- are straight-forward and follow from definitions (12) and 



da 



ill, 



a, b) in closed-form with respect to L. Because we absorb slack variables through our kernel 
modification in ^ and we use a hard-margin SVM with the modified kernel, all support 
vectors must lie exactly one unit from the hyperplane and satisfy 



C^Kijajyj + b) = 1. 



(15) 



Since the parameters ay of non-support vectors are zero, the derivative of these ay with 
respect to L are also all-zero and do not need to be factored into our calculation. We can 
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50 100 

iterations 




50 100 

iterations 



Figure 2: An example of training, validation and test error on the Credit data set. As 
the loss Ly (left) decreases, the validation error £y (right) follows suit (solid 
blue lines). For visualization purposes, we did not use a second-order function 
minimizer but simple gradient descent with a small step-size. 



therefore (with a slight abuse of notation) remove all rows and columns of K that do not 



correspond to support vectors and express (15) as a matrix equality 



K y 

y T o 

H 



where Ky = yi]jj~K(xi 



Consequently, we can solve for a and b through left-multiplication 



with H . Further, the derivative with respect to L can be derived from the matrix inverse 



rule (Petersen and Pedersen, 2008), leading to 



(a, b) 



H 



-l, 



1---1,0) T and 



d(a, b) 



H 



(a,b) 



dL 



(16) 



i.) 



4.3 Optimization 



Because the derivative follows directly from the definition of K and (12), this completes 
the gradient We can now use standard gradient descent, or second order methods to 

minimize (13) up to a local minimum. It is important to point out that (16) requires the 
computation of the optimal a, b, given the current matrix L. These can be obtained with 
any one of the many freely available SVM packages (Chang and Lin, 2001) by solving the 
SVM optimization Q for the kernel K that results from L. In addition, we also learn the 
regularization constant C from eq. ^ with our gradient descent optimization. For brevity 
we omit the exact derivation of ^0- but point out that it is very similar to the gradient 
with respect to L, except that it is computed only from the diagonal entries of K. 

We control the steps of gradient descent by early-stopping. We use part of the training 
data as a small hold-out set to monitor the algorithm's performance, and we stop the 
gradient descent when the validation results cease to improve. 
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We refer to our algorithm as Support Vector Metric Learning (SVML). Algorithm [T] 
summarizes SVML in pseudo-code. Figure [2] illustrates the value of the loss function Cy as 
well as the training, validation and test errors. 

Algorithm 1 SVML in pseudo-code. 
1 

2 
3 
4 
5 
6 



Initialize L. 

while Hold-out set result keeps improving do 
Compute kernel matrix K from L as in ([7]). 
Call SVM with K to obtain a and b. 



Compute gradient as in (16) and perform update on L. 
end while 



4.4 Regularization and Variations 

In total, we learn d x d parameters for the matrix L and n + 1 parameters for a and b. 
To avoid overfitting, we add a regularization term to the loss function, which restricts the 
matrix L from deviating too much from its initial estimate Lq: 



ML) 



1 

W\ 



J2 s a (yh(x)) + A||L 
(x )2/ )ev 



2 
F 



(17) 



Another way to avoid overfitting is to impose structural restrictions on the matrix L. 
If L is restricted to be spherical, L = Il cfxa! ) SVML reduces to kernel width estimation. 
Alternatively, one can restrict L to be any diagonal matrix, essentially performing feature 
re-weighing. This can also be useful as a method for feature selection in settings with noisy 



features (Weston et al., 2001). We refer to these two settings as SVML-Sphere and SVML- 



Diag. Both of these special scenarios have been studied in previous work in the context of 



kernel parameter estimation (Ayat et al. 2005 Chapelle et al. 2002). See section |6| for a 
discussion on related work. 

Another interesting structural limitation is to enforce L G lZ rxd to be rectangular, by 
setting r < d. This can be particularly useful for data visualization. For high dimensional 
data, the decision boundary of support machines is often hard to conceptualize. By setting 
r = 2 or r = 3, the data is mapped into a low dimensional space and can easily be plotted. 



4.5 Implementation 

The gradient, as described in this section, can be computed very efficiently. We use a 
simple C/Mex implementation with Matlab. As our SVM solver, we use the open-source 
Newton- Raphson implementation from Olivier Chapell^j As function minimizer we use an 
open-source implementation of conjugate gradient descent]^} Profiling of our code reveals 
that over 95% of the gradient computation time was spent calling the SVM solver. For a 
large-scale implementation, one could use special purpose SVM solvers that are optimized 



for speed (Bottou et al. 2007; Joachims 1998). Also, the only computationally intensive 



3. Available at http://olivier.chapelle.cc/primal/. 

4. Courtesy of Carl Edward Rasmussen, available from http://www.gatsby.ucl.ac.uk/~edward/code/ 



minimize/minimize .m 
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parts of the gradient outside of the SVM calls are all trivially parallelizable and could be 
computed on multiple cores or graphics cards. However, as it is besides the point of this 
paper, we do not focus on further scalability. 



5. Results 



To evaluate SVML, we revisit the nine data sets from Section 3.4 For convenience, Table [2] 



restates all relevant data statistics and also includes classification accuracies for all metric 
learning algorithms. SVML is naturally slower than SVM with Euclidean distance but 
requires no cross validation for any meta parameters. For better comparison, we also include 
results for 1-fold and 5-fold cross validation for all other algorithms. In both cases, the meta 
parameters cj 2 ,C were selected from five candidates each - resulting in 25 or 125 SVM 
executions. The kernel width a 2 is selected from within the set {Ad,2d,d, |, ^} and the 
meta parameter C was chosen from within {0.1, 1, 10, 100}. As SVML is not particularly 
sensitive to the exact choice of A - the regularization parameter in ( |17[ ) - we set it to 100 for 
the smaller data sets (n < 1000) and to 10 for the larger ones (Page, Gamma). We terminate 
our algorithm based on a small hold-out set. 



Statistics 


Haber 


Credit 


ACredit 


Trans 


Diabts 


Mammo 


CMC 


Page 


Gamma 


^examples 


306 


653 


690 


748 


768 


830 


962 


5743 


19020 


^features 


3 


15 


14 


4 


8 


5 


9 


10 


11 


^training exam. 


245 


522 


552 


599 


614 


664 


770 


4594 


15216 


^testing exam. 


61 


131 


138 


150 


154 


166 


192 


1149 


3804 


Metric 


Error Rates 


Euclidean 1-fold 


27.16 


13.16 


14.36 


21.05 


23.84 


18.43 


27.12 


2.61 


12.70 


Euclidean 3-fold 


27.40 


13.10 


14.13 


20.58 


23.39 


18.27 


26.77 


2.55 


12.68 


Euclidean 5-fold 


27.37 


13.12 


14.11 


20.54 


23.46 


18.17 


26.91 


2.56 


12.62 


ITML + SVM 1-fold 


26.57 


13.78 


14.15 


23.01 


23.19 


19.14 


28.65 


4.82 


22.63 


ITML + SVM 3-fold 


26.13 


13.58 


13.88 


22.98 


23.17 


17.98 


27.68 


4.77 


21.50 


ITML + SVM 5-fold 


26.50 


13.68 


14.71 


22.86 


23.14 


18.20 


27.67 


4.78 


21.50 


NCA + SVM 1-fold 


26.44 


13.74 


14.14 


22.89 


22.84 


17.76 


27.47 


4.73 


N/A 


NCA + SVM 3-fold 


26.47 


13.45 


14.00 


22.67 


22.72 


18.12 


26.60 


4.73 


N/A 


NCA + SVM 5-fold 


26.39 


13.48 


14.10 


22.59 


22.74 


18.17 


26.53 


4.74 


N/A 


LMNN + SVM 1-fold 


26.38 


13.11 


13.97 


21.02 


22.97 


17.84 


26.80 


2.85 


13.04 


LMNN + SVM 3-fold 


26.44 


13.30 


13.93 


20.73 


22.86 


17.57 


26.66 


2.81 


12.79 


LMNN + SVM 5-fold 


26.70 


13.48 


13.89 


20.81 


22.89 


17.78 


26.68 


2.66 


13.04 


SVML-Sphere 


27.42 


13.43 


13.78 


20.26 


23.24 


17.81 


28.23 


3.61 


12.70 


SVML-Diag 


28.15 


13.33 


15.11 


20.46 


24.14 


17.35 


29.51 


2.92 


12.54 


SVML 


25.99 


12.83 


13.92 


20.89 


23.25 


17.57 


26.34 


3.41 


12.54 



Table 2: Statistics and error rates for all data sets. The data sets are sorted by smallest to 
largest from left to right. The table shows statistics of data sets and error rates 
of SVML and comparison algorithms. The best results (up to a 5% confidence 
interval) are highlighted in bold. 
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Figure 3: Timing results on all data sets. The timing includes metric learning, SVM training 
and cross validation. The computational resources for SVML training are roughly 
comparable with 3-5 fold cross validation with a Euclidean metric. (NCA did not 
scale to the Gamma data set.) 



As in Section [2| experimental results are obtained by averaging over multiple runs on 
randomly generated 80/20 splits of each data set. For small data sets, we average 200 splits, 
20 for medium size, and 1 for the large data set Gamma (where train/test splits are pre- 
defined). For the SVML training, we further apply a 50/50 split for training and validation 
within the training set, and another 50/50 split on the validation set for early stopping. 
The result from SVML appeared fairly insensitive to these splits. 

As a general trend, SVML with a full matrix obtains the best results (up to significance) 
on 6 out of the 9 data sets. It is the only metric that consistently outperforms Euclidean 
distances. The diagonal version SVML-Diag and SVML-Sphere both obtain best results 
in 2 out of 9 and are not better than the uninformed Euclidean distance with 5-fold cross 
validation. None of the kNN metric learning algorithms perform comparably. 

In general, we found the time required for SVML training to be roughly between 3- 
fold and 5-fold cross validation for Euclidean metrics, usually outperforming LMNN, ITML 
and NCA. Figure [3] provides running-time details on all data sets. We consider the small 
additional time required for SVML over Euclidean distances with cross validation as highly 
encouraging. 

5.1 Dimensionality Reduction. 

In addition to better classification results, SVML can also be used to map data into a 
low dimensional space while learning the SVM, allowing effective visualizations of SVM 
decision boundaries even for high dimensional data. To evaluate the capabilities of our 
algorithm for dimensionality reduction and visualization, we restrict L to be rectangular. 
Specifically, a mapping into a r = 2 or r = 3 dimensional space. As comparison, we 
use PCA to reduce the dimensionality before the SVM training without SVML (all meta 
parameters were set by cross-validation) . Figure [4] shows the visualization of the support 
vectors of the Credit data set after a mapping into a two dimensional space with SVML 
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PCA SVML-2d 



Figure 4: 2D visualization of the Credit data set. The figure shows the decision surface 
and support vectors generated by SVML (L G lZ 2xd ) and standard SVM after 
projection onto the two leading principal components. 



and PCA. The background is colored by the prediction function h(-). The 2D visualization 
shows a much more interpretable decision boundary. (Visualizations of the LMNN and 
NCA mappings were very similar to those of PCA.) Visualizing the support vectors and 
the decision boundaries of kernelized SVMs can help demystify hyperplanes in reproducing 
kernel Hilbert spaces and might help with data analysis. 



6. Related Work 

Multiple publications introduce methods to learn Mahalanobis metrics. Previous work has 



focussed primarily on Mahalanobis metrics for k- nearest neighbor classifiers (Davis et al. 
2007} |Globerson and Rowels] [20051 |Coldberger et al.j [20051 IShental et aL] [20021 [Shalev 



Shwartz et al. , |2004[ |Weinberger et aT\ 2006 ) and clustering (Davis et al 



Shwartz et al. , 2004 Shental et al. 



2002 



2007] |Shalev- 



Xing et al. 2002). None of these algorithms is 



specifically geared towards SVM classification. A detailed discussion of NCA, ITML and 
LMNN is provided in Section (3j 

Another related line of work focusses on learning of the kernel matrix. The most com- 



mon approach is to find convex combinations of already existing kernel matrices (Bach et al. 



2004; Lanckriet et al. 2004) or kernel learning through semi-definite programming (Grae 



pel 2002 Ong et al. 2005). The most similar area of related work is the field of kernel 



parameter estimation (, 


\yat et al. 


2005 


Chapelle et al. , 2002 Cherkassky and Ma 


2004 


Friedrichs and Igel 


200, c 


)). In particular, 


( Friedrichs and Igel 


2005 


) can be viewed as learn- 



ing a Mahalanobis metric for the Gaussian kernel - however, instead of minimizing a soft 
surrogate of the validation error with gradient descent, the authors use genetic programming 



to maximize the "fittness" of the kernel parameters. The method of (Chapelle et al. 2002) 



uses gradient descent to learn the a parameter of the RBF kernel matrix. SVML was highly 



inspired by this work. The main difference between our work and (Chapelle et al. 2002) is 
that SVML learns the full matrix L, and therefore a Mahalanobis metric, whereas Chapelle 
et al. only learn the parameter a or individual weights for blocks of features. Spherical and 
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diagonal SVML can be viewed as a version of (Chapelle et al. , 2002). Similarly, ( Ayat et al 



2005 Schittkowski , 2005 ) also explore feature re- weighting for support vector machines with 



alternative loss functions. 



7. Conclusion 

In this paper we investigate metric learning for SVMs. An empirical study of three of the 
most widely used out-of-the-box metric learning algorithms for kNN classification shows 
that these are not particularly well suited for SVMs. As an alternative, we derive SVML, 
an algorithm that seamlessly combines support vector classification with distance metric 
learning. SVML learns a metric that attempts to minimize the validation error of the 
SVM prediction at the same time as it trains the SVM classifier. On several standard 
benchmark datasets we demonstrate that our algorithm achieves state-of-the-art results 
with very high reliability. An important feature of SVML is that it is very insensitive to 
its few parameters (which we all set to default values) and does not require any model 
selection by cross validation. In fact, we demonstrate that SVML outperforms traditional 
SVM-RBF with the Euclidean distance (where parameters are set through cross validation) 
consistently in accuracy while requiring a comparable amount of computation time. These 
aspects make SVML a very promising general-purpose metric learning algorithm for SVMs 
with RBF kernels, which also incorporates automatic model selection. We are currently 



implementing an open-source plug-in for the popular LIBSVM library (Chang and Lin 



2001) and extending it to multi-class settings. 
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